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Summary. This paper investigates the estimation problem in a regression-type 
model. To be able to deal with potential high dimensions, we provide a procedure 
called LOL, for Learning Out of Leaders with no optimization step. LOL is an auto- 
driven algorithm with two thresholding steps. A first adaptive thresholding helps 
to select leaders among the initial regressors in order to obtain a first reduction of 
dimensionality. Then a second thresholding is performed on the linear regression 
upon the leaders. The consistency of the procedure is investigated. Exponential 
bounds are obtained, leading to minimax and adaptive results for a wide class 
of sparse parameters, with (quasi) no restriction on the number p of possible re- 
gressors. An extensive computational experiment is conducted to emphasize the 
practical good performances of LOL. 



1. Introduction 

The general linear model is considered here, with a particular focus on cases 
where the number p of regressors is large compared to the number n of observa- 
tions (although there is no such restrictions) . These kinds of models have today 
a lot of practical applications in many areas of science and engineering including 
collaborative filtering, machine learning, control, remote sensing, and computer 
vision just to name a few of them. Examples in statistical signal processing and 
nonparamctric estimation include the recovery of a continuous-time curve or a 
surface from a finite number of noisy samples. Other interesting fields of appli- 
cation are radiology and biomedical imaging when fewer measurements about 
an image are available compared to the unknown number of pixels collected. 
In biostatistics, high dimensional problems frequently arise specially in genomic 
when gene expression are studied given a huge number of initial genes compared 
to a relatively low number of observations. 



A considerable amount of work has been produced in this domain in the last 
years, which has been a large source of inspiration for this paper: algorithms 
coming from the learning framework Barron ct al. (2008), Binev et al. (2005), 
Bincv ct al. (2007a), Binev et al. (2007b)), as well as the extraordinary explosive 
domain of l-[ penalties, among many others Tibshirani (1996), Candes and Tao (2007), 
Bickel ct al. (2008), Bunea ct al. (2007a), Bunca ct al. (2007b), Fan and Lv (2008) 
and Candes and Plan (2009). See also Lounici (2008) and Alquicr and Hebiri (2009). 

The essential motivation of this work is to provide one of the simplest proce- 
dures that achieves, in the same time, good performances. LOL algorithm (for 
Learning Out of Leaders) consists in a two steps thresholding procedure. As 
there is no optimization step, it is important to address the following question: 
what are the domains where the procedure is competitive compared to more 
sophisticated algorithms, especially to algorithms performing one or two steps 
i-\ minimization ? One of our aim here is not only to delimit where LOL is 
competitive but also to point out where the simplicity of LOL induces a slight 
lack of efficiency from both a theoretical point of view as from a practical aspect. 

Let us start by introducing the ideas of the emergence of LOL algorithm. 
This simple procedure can be viewed as an 'explanation' or as a 'cartoon' of Hi 
minimizations. It is well known that when the regressors are normalized and 
orthogonal, £i minimization corresponds to soft thresholding which itself is close 
to hard thresholding. Hence, it is quite natural to expect that thresholding 
should perform well, at least in cases not too far from these orthonormal condi- 
tions. It corresponds, as specified below, to small coherence conditions. A tricky 
problem occurs when the regressors are not orthonormal or when the number 
of regressors is large. Then, the minimum least squares estimator has a non 
unique solution and the solutions are very unstable. This is the heart and the 
main difficulty for the £i minimizcrs or more generally for all methods based on 
sparsity assumptions. In order to be solved, the problem requires essentially -as 
will be discussed extensively in the sequel- two types of conditions: sparsity of 
the solution and isomctry properties for the matrix of regressors. This is often 
the part where the algorithms computation cost shows up. Obviously a simple 
thresholding would not fit, but the above mentioned conditions can ensure that 
it is at least possible to select some regressors and exclude some others. LOL 
algorithm solves the difficult problem of the choice of the regressors in a quite 
crude way by adaptively selecting N regressors which are the most correlated 
to the target: this defines the first step thresholding of LOL, determining the 
N leaders. The number N is chosen using a fine tuning parameter depending 
on the coherence. It has to be emphasized that the choice is auto driven in the 
algorithm. In a second thresholding step, LOL regresses on the leaders, then 
thresholds the estimated coefficients taking into account the noise of the model. 

The properties of LOL are here investigated specifically for the prediction 
problem. More precisely, it is established in this paper that LOL has a prediction 
error which is going to zero in probability with exponential rates. These types 
of results are often called Bahadur type efficiency. Although Bahadur efficiency 
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of test and estimation procedures goes back to the sixties (see Bahadur (I960)), 
it has seen recently a revival in learning theory, where the rates of convergence 
(preferably exponential) of the procedures are investigated and compared to op- 
timality. It is also related to a common concept in learning theory: The Probably 
Approximately Correct (PAC) learning paradigm introduced in Valiant (1984). 

Of course, because of the straightforwardness of the method, some loss of 
efficiency is expected compared to more elaborate and costly procedures. But 
even with a loss, the limitations of the procedure can bring an interesting infor- 
mation on the £i minimizers themselves. From both theoretical and practical 
point of view, with small coherence, LOL procedure appears to be as powerful 
as the best known procedures. The exponential rates of convergence match for 
instance the lower bounds obtained in Raskutti et al. (2009). This result is ob- 
tained under minimal conditions on the number p of potential regressors. Also 
even with a loss in the rate, a positive aspect is that the practitioner is informed 
of the possible instability of the method since the coherence can be computed 
using the observations before any calculation. This is notably not the case for 
usual conditions such as RIP or even more abstract ones which are impossible 
to verify in practice. An intensive calculation program is performed to show the 
advantages and limitations of LOL procedure in several practical aspects. The 
case where the regressors are forming a random design matrix with i.i.d. entries 
is investigated in Section 6. Different laws of the inputs are studied (Gaussian, 
Uniform, Bernoulli or Student laws) inducing a specific coherence for the design 
matrix. Several interesting features are also discussed in this section. Dependent 
inputs are simulated and an application with real data is also discussed. The 
impact of the sparsity and the undetermination of the regression on the perfor- 
mances of LOL are studied. A comparison with two others two-step procedures 
namely Fan and Lv (2008) and Candcs and Plan (2009) is also performed. The 
most interesting conclusion being that the practical results are even better and 
more comforting than the theoretical ones in the sense that LOL shows good 
performances, even when the coherence is pretty high. 

To summarize this presentation and answer to the question "In what type 
of situations should a practitioner prefer to use LOL rather than other available 
methods?" , our results and our work prove that when the number of regressors 
p is very large, and when the computational aspects of optimization procedures 
become difficult as well as the theoretical results uncertain, LOL should be pre- 
ferred by a practitioner after ensuring (and this is done by a simple calculation) 
that the coherence is not too high. On the other hand when the coherence is 
very high, one should probably be suspicious enough regarding any method... 

The paper is organized as follows. In Section 2, the general model and the 
notations are presented. In Section 3, LOL is detailed as other procedures with 
a l-\ optimization step; Comparisons with other procedures are later discussed 
in Section 5. In Section 4, after stating the hypotheses needed on the model, 
theoretical results are established. The practical performances of LOL are in- 
vestigated in Section 6 and the proofs are detailed in Section 7. 
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2. Model and coherence 



2. 1. General model 

In this paper, we observe a pair (Y, O) G R n x R nxp where (D is the design 
matrix and Y a vector of response variables. These two quantities are linked by 
the standard linear model 

Y = Oa + u+e (1) 
where the parameter <x € fff is the unknown vector to be estimated and 

• the vector z — [z\ , . . . , En) 1 is a (non observed) vector of random errors. It 
is assumed to be independent Gaussian variables N(0, cr 2 ) but essentially 
comparable results can be obtained in the case of zero mean subgaussian 
errors (see the remark before Lemma 3). 

• the vector u = (ui , . . . ,!!„)' is a non observed vector of (possibly) random 
errors. Its amplitude is assumed to be small. The differences between the 
two previously described "errors" lies in the fact that the £t's are centered 
but unbounded and independent, while the U|'s are only bounded. The 
necessity of introducing these two types of errors becomes clear in the 
functional regression example. 

• O is a n x p known matrix. This paper focuses on the interesting case 
where p ^> n but it is not necessary. We assume that O has normalized 
columns (or normalized them) in the following sense: 

-^£>?e = 1, V*=1...,p. (2) 

2.2. Examples 

An example of such a model occurs when the matrix O is a random matrix 
composed of n independent and mainly identically distributed random vectors 
of size p. The simulation study given in Section 6 details the important role 
played by the distribution of these random vectors. 

A second application is the learning (also called functional regression) model 

Yi = f(Xi) + e t , i=1...n (3) 

where f is the parameter of interest. This model is classically related to the 
previous one using a dictionary T> — {g\, I < p} of size p, O becoming then 
the matrix with general term (t>u — g^(Xi). Assuming that f can be rea- 
sonably well approximated using the elements of the dictionary means that 
f can be written as f = ^ g€E , Kg9 + H where h is hopefully small. It be- 
comes clear here that = h.(Xi). This case has been investigated in more de- 
tails in Kcrkyacharian, Mougeot, Picard, and Tribouley (Kcrkyacharian et al.)) 
using an earlier and less elaborated version of LOL. 
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2.3. Coherence 

In the sequel, the following notations are used. Let m be an integer and q > 0, 
for any x G M m , 

/ m \ Vq 
IMIn(m) := (ll |xk|q J 

denotes the l q (IR m )— norm (or quasi norm) and, for any x £ K n , 

ll x lln : = ~ll x lll 2 (n) 

denotes the quadratic empirical norm. We define the following p x p Gram 
matrix as 

M := -0*0. 

TL 

The quantity 

1 n 

T n = SUp |M{ m | = Sup |- Y_ ^i^iml 

is called the coherence of the matrix M. Observe that T n is a quantity directly 
computable from the data. It is also a crucial quantity because it induces a 
bound on the size of the invertible matrices built with the columns of M. More 
precisely, fix < v < 1 and let X be a subset of indices of {1,...,p} with 
cardinality m. Denote 0|x the matrix restricted to the columns of O whose 
indices belong to X. If 2r n < v, the associated Gram matrix 

M(X) := lofjOn 

is almost diagonal as soon as m is smaller than N := [ v / T nJ (where [v/x^J 
denotes the integer part of v/x n ) in the sense that it satisfies the following so 
called Restricted Isometry Property (RIP) 

Vxel m , \\x\\l {m] V-v)<x t M{I)x<\\x\\l {m) V+v). (4) 

This proves in particular that the matrix M(I) is invertible. The proof of (4) is 
simple and can be found together with a discussion on the relations between RIP 
Property and conditions on the coherence, for instance in Blanchard et al. (2009). 
The RIP Property (4) can be rewritten as follows. The following lemma is a key 
ingredient of our proofs. 

Lemma 1. Let X be a subset o/{1,...,p} satisfying #[X) < N. For any 
x e M #(x) , we get 

-V)||x||? 2(#(z) j < ||X X < lln < H + V) ||X|| 12(#(I)) . 

lei 
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3. Estimation procedures 

In this section, the estimation of the unknown parameter a using LOL is de- 
scribed first. Next, a short review on the procedures directly connected to LOL 
is proposed. 

Once for all, the constant v is fixed. This constant is obviously related to the 
precision of LOL main procedure: the default value here considered is v = 0.5. 



3. 1. LOL Procedure 

As inputs, LOL algorithm requires 4 pieces of information: 

• The observed variable Y and the regression variables O = , . . . <D #p ). 

• The tuning parameters A T1 (1) and A n (2) giving the level of the thresholds. 

Observe that the algorithm is adaptive in the sense that no information on the 
sparsity of the sequence a is necessary. An upper bound for the cardinal of the 
set of leaders is first computed: N <— L^/TnJ- Thereafter, LOL performs two 
major steps: 

• Find the leaders by thresholding the "correlation" between Y and the CP.f's 
at level A n (l). B denotes the set of indices of the selected leaders. The 
size of this set is bounded with N by retaining only (when necessary) the 
indices with maximal correlations. 

• Regress Y on the leaders <D|g = (®.f)ege and threshold the result at level 

An(2). 

The following pseudocode gives details of the procedure. Note that there is no 
step of optimization and no iteration procedure. 



LOL(0,Y,A n (1),A n (2)) 

Input: observed data Y, regression variables O, tuning parameters A n (1), A n (2) 
Output: estimated parameters a*, and predicted value Y 



G 



STEP 

v = 0.5 

T n <- nT 1 max E ^ m | ^T l =1 O if cD tr 

n<-L£J 



STEP 1 

For £ = 1 : p 

a? <- 1 LILi «Y t 

af <- UiI{\oT e \ > A n (l)} 

End(for) 

B<-{£,aT ^0} 

If #23 > N 

indices <— sort(|a|) 
8 <- indices[1 : N] 

End(if) 

STEP 2 

«is <- (*f ib®ib) _1 ®ibY 
afg <- S| 8 I{|a| 8 | > A n (2)} 

Y <— O a* 



{Initialize} 

{Compute the coherence} 
{Compute the upper bound 
for the cardinal of the leaders set} 

{Find the leaders} 

{Compute the 'correlations' 
between the observations and the regressors } 

{Threshold } 

{Determine the leaders} 

{Sort the 'correlations' of the candidates} 
{Take the indices associated to the N— th largest} 



{Regress on the leaders} 
{Least square estimators} 
{Threshold} 

{Find the predicted value} 



3.2. Several inspirations 

Although it is impossible to be exhaustive in such a productive domain, some of 
the works directly in relation to our construction are hereafter mentioned. We 
apologize in advance for all the works that are not mentioned but still remain in 
connection. For a comprehensive overview, we refer to Fan and Lv (2010). 

Several authors propose procedures to solve the selection problem or the esti- 
mation problem in cases where the vector a has only a small number of non zero 
components, and (often) when the design matrix CD is composed of i.i.d. ran- 
dom vectors: sec among many others Tibshirani (1996), Candes and Tao (2007), 
Bickel ct al. (2008), Bunca ct al. (2007a) and Bunca ct al. (2007b). 

A focus is particulary made here on the 2-steps procedures which are also 
commonly used, and apparently for a long time, since in 1959 such a procedure 
is already discussed (see Satterthwaite (1959)). In Candes and Tao (2007) and 
Candes and Plan (2009), the leaders are selected with respectively the Dantzig 
procedure and the Lasso procedure. Then, the estimated coefficients are com- 
puted using a linear regression on the leaders. Using an intensive simulation 
program, Fan and Lv (2008) show that it could be unfavorable to use the proce- 
dures Lasso or Dantzig before the reduction of the dimension. They also provide 
a search among leaders called Sure Independence Screening (SIS) procedure. 
This procedure is similar to the one discussed in this paper: The leaders are the 
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N = LYnfT-J columns of O with largest correlations to the target variable Y (y n 
is a tuning sequence tending to zero). This step is followed with a subsequent 
estimation procedure using Dantzig or Lasso. All these methods show a higher 
complexity compared to LOL. 

LOL procedure can also be connected to the family of Orthogonal Match- 
ing Pursuit algorithms as well as in general to the Greedy Algorithms. For 
this interesting literature, we refer among others to Necdell and Tropp (2009), 
Tropp and Gilbert (2007), Barron ct al. (2008). The main advantage of LOL 
compared to this kind of algorithms is that there is no iterative search of the 
leaders. All the leaders are selected in one shot and the procedure stops just 
after the second step. Moreover, convergence results are almost as good as those 
procedures in many situations. 

4. Main theoretical results 

This section states the theoretical results of LOL procedure. The measures of 
performances used in the theorems are first presented, then the assumptions on 
the set of parameters a are given. 

4.1. Loss fonction 

Let us define the following loss function to measure the difference between the 
true value a € R p and the result &* computed by LOL. Denote ®i, the i— th 
line of the matrix O and recall that the i— th observation is given by the model: 

Yi = Oua + Ui + ct. 

The predicted i— th observation is Yi = The criterium of performance is 

defined by the empirical quadratic distance between the predicted variables and 
their expected values. 

. n 2 i ti / p \ 2 

d(a*,a) 2 = -^(Y i -EY i ) =-£l£(ftJ-at)<D u +Ui) 

n i=1 1=1 \£ = 1 / 

which can be rewritten using the empirical norm 

p 

d(&*, a) := || ^J&J - on)0. t +u.\\ n . 

£=1 

Observe that the considered loss is the usual error of prediction 

d(&*,a) = ||G>(&* - a)|| n 
when the 'errors' Ui's are all zero. 
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4.2. Bahadur-type efficiency 

Our measure of performance is issued from the Bahadur efficiency of test and 
estimation procedures and is defined for any tolerance r\ > as 



AC n (LOL,T|, a) = P (d(&*, a) > u) . 



This quantity measures a confidence that the estimator is accurate to the tol- 
erance T] if the true point is a. We also define and consider uniform confidence 
over a class 9 : 



This quantity has been studied for instance in DcVore et al. (2006) in the learn- 
ing framework. In most examples, it is proved that there exist a phase tran- 
sition and a critical value r\ n depending on n and such that AC n (T,T|,0) 
decreases exponentially for any r\ > r\ n . More precisely, in terms of lower bound 
-but similar bounds are also valid in terms of upper bounds-, it is proved in 
DeVorc ct al. (2006) that 



where N(0,r|) is the tight entropy analogue of the Sobolev covering numbers. 
On this expression, r| n appears quite convincingly as a turning point after which 
the exponential term dominates the entropy term. Observe that the critical 
value r\ n is essential since it yields bounds for sup ae0 E a d(&*, oc) which is an- 
other (more standard) measure of performance of the procedure. The results in 
DcVore ct al. (2006) are obtained in the learning framework; however identical 
bounds can easily be expected in the setting (1) of this paper, as results obtained 
in Raskutti et al. (2009). This is discussed in more details in the sequel since 
similar bounds are obtained for LOL with sparsity constraints defined below. 

4.3. Performances of the procedure LOL. l q ball constraints 

In this part, we consider the following sparsity constraint 

for q g (0,1], B q (lVl) :={<xeR?, ||a|| iq(p] < M} 



AC n (LOL,T|,e) = supP(d(&*,a) >r|). 



(6) 



age 





or 



p 



for q =0, B (S,M) :={a£l p , X I{|cc i + 0} - S > \\<4viv) ^ M l- 
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Theorem 1. Let M > and fix y in ]0, 1 [. Assume that there exists a pos- 
itive constant c' such that p < exp(c'n). Suppose there exist positive constants 
c, Co such that 



/logp , . /1 

T n < ca/ and sup |ut| < Co \/ — . 

V TL i=1,...,n V Tl 

Lei ms choose the thresholds A n (1) and A n (2) such that 



(8) 



V(2)=T 3A /^ and A n (1 ) = T 41 ^ 



Tl 



Tl 



/or T4 > Ti V T2 c V T 3 > where 

1 



64 cr V 1 V 



1 — v)cr 



and J 2 = 6M V 



(4M + 3c ) 



12ff 



Then, there exist positive constants D and J depending on v, c, c', Co, T3, T4 smc/i 



sup P(d(6t*,a) >ri) < < 



«e b„ (M) 



1 for u 2 < D 



1-q/2 



1-q/2 



sup P(d(ft*,a) >u) < 

«E B (S,M) 



1 for u 2 < D ^SJl 



for any S < v/x n . 

We immediately deduce the following bound for the usual expected error 

Corollary 1. Forr > 1 arbitrary, under the same assumptions as in The- 
orem 1, we have 



sup Ed(&*,ct) r < D 

B q (M) 



(l-q/2)r/2 



/or some positive constant D' depending on v, c, c', Co, T3, T4, as well as 

/SI \ r/2 
sup Ed(&*,cx) r < D' ' gp l 

B (S,M) 



for any S < v/t^ 
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4.4. Performances of LOL procedure. MaxiSet point of view 

In this section we develop a slightly different point of view issued from the 
maxiset theory (see for instance Kerkyacharian and Picard (2000)). More pre- 
cisely our aim is to evaluate the quality of our algorithm when the coherence 
and thresholding tuning constants are given (fixed). Especially, it means that we 
do not assume in this part that the coherence satisfies T n < 0(-\/logp/n). For 
that, we consider a set V(S, M) of parameters a depending on these constants 
M, S > and prove that the right exponential decreasing of the confidence is 
achieved on this set. The phase transition r| n is depending on the tuning con- 
stants and of the coherence in the following way 

ni = o ( ^vs<). 

Observe that we do not prove that the set V(S, M) is exactly the maxiset of 
the method (considered in terms of r\ n ) since we do not not prove that it is the 
largest set with the phase transition r\ n . However the following theorem reflects 
quite extensively the theoretical behavior of LOL, even in case of deterioration 
due to a high coherence or a bad choice of the thresholds. Notice also that 
Theorem 1 is a quite easy consequence of Theorem 2. 

Let us now define the set V(S, M) by the following sparsity constraints. There 
exist S < L^/TnJ and constants M, Co> Ci , C2, such that the vector a £ K p 
satisfies the following conditions 

Nlii(p)<M, (9) 
#{£e{1,...,p}, |ad > A n (2)/2} <S (10) 

Y_ |a m |< Cl f^V /2 (11) 



(£)>N 



L 

£ = 1 



|a £ | 2 I{|a £ |<2A n (l)}<ci^p (12) 



Recall that (<X(<)) is the ordered sequence (for the modulus) |ct(i)| > |ct(2)l > 
. . . |ct( p )|. For S, M > 0, V(S, M) denotes the class of models of type (1) satisfying 
the sparsity conditions (9), (10), (11), (12). Note that we emphasize in the 
notation of the set V(S, M.) the constants S and M, while the set is depending 
on other additional constants, since these two constants play a crucial role. 

Theorem 2. Let S, M > and fix v in ]0, 1[. The thresholds A n (1) and 
An (2) are chosen such that 

A n (1) > ( Tt VT 2 T n ] and A n (2)<A n (l) 
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Ti = ( 64a V 1 V - ~ - I and T 2 =(6MV 1 ' 



Then, if in addition we have 

sX l/2 



i=1,...,n 



n 



sup |ui| < Co - (13) 



there exist positive constants D andy depending on v, cr 2 , M, Co> Ci , Cz, such that 

Ae-y^ 1 for u 2 > D (^^- V St 2 ) , 

1 for r| 2 < D (iiaSE V St 2 ' 



sup P(d(fc*,a) >T|) < 

a£ V(S,M) 



(14) 



For a sake of completeness, the constants D and y are precisely given at the 
end of the proof of Theorem 2. However, it is obvious that the constants provided 
here are not optimal: for instance in the proof, in order to avoid unnecessary 
technicalities, most of the events are divided as if they had an equal importance, 
leading to constants which are each time divided by 2. Obviously there is some 
place for improvement at any of these stages. 

An elementary consequence of Theorem 2 is the following corollary which 
details the behavior of the expectation of d(&*, a). Notice also that we did not 
give here explicit oracle inequalities, which however could be derived from the 
proof of Theorem 2. 

Corollary 2. Forr > 1 arbitrary, under the same assumptions as in The- 
orem 2, we get 

sup Ed(&*,a) r <D' I ^HvSt 2 
v(S,m) V n 

for some positive constant D' depending on v, u , M, Co> Ci , C2 and r. 



5. Remarks and Comparisons 

5.1. Results under l q constraints 

It is important to discuss the relations of the results in Theorem 1 with Raskutti ct al. (2009) 
which provides minimax bounds in a setting close to ours. Their results basi- 
cally concern exponential inequalities (as ours) but they are only interested in 
the case u — T\ n for which they prove upper and lower bounds. If we compare 
our results to theirs, we find that LOL is exactly minimax for any q in (0, 1], 
with even a better precision since we prove the exponential inequality for any r\ . 
In the case q = 0, we have a slight logarithmic loss. Notice that we also need a 
bound on ||a|ju We do not know if this is due to our proof or specific to the 
method. 
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5.2. Ultra high dimensions 

One main advantage of LOL is that it is really designed for very large dimensions. 
As seen in the results of Theorem 1 and Theorem 2, no limitation on p is required 
except p < cxp(cn) (in fact this is only needed in Theorem 1). Notice that, if 
this condition is not satisfied, not any algorithm is convergent as proved by the 
lower bound of Raskutti ct al. (2009). Moreover, the fact that the algorithm has 
no optimization step is a serious advantage when p becomes large. 

5.3. Adaptation 

Our theoretical results are provided under conditions on the tuning quantities 
A n (1) and A n (2). The default values issued from the theoretical results are the 
following 



It is a consequence of Theorem 1 that LOL associated with A^(1) and A^(2) is 
adaptive over all the sets B q (M) and Bo(S, M), with respect to the parameters 
q and S. These default values behave also reasonably well in practice. However, 
they require a fine tuning of the constants Ti and I2 which is proposed in a 
slightly more subtle way in the simulation part (see Section 6). 

5.4. Coherence condition 

As can be seen in Theorem 1, LOL is minimax under a condition on the coher- 
ence of the type T n < Ci/logp/n. This condition is verified with overwhelming 
probability for instance when the entries of the matrix CD are independent and 
identically random variables with a sub-gaussian common distribution. In Sec- 
tion 6, we precisely investigate the behavior of LOL when this hypothesis is 
disturbed. This bound is generally stronger as a condition compared to other 
ones given in the literature such as the RIP condition, or weaker ones. However, 
as explained in the sequel, these other conditions are often impossible to verify 
on the data. We consider as a benefit that the procedure is giving with T n an 
indication of a potential misbehavior. Besides, Theorem 2 details the behavior 
of the algorithm when this condition is not verified. 

5.5. Comparison with some existing algorithms 

As mentioned in the previous section, LOL finds its inspiration in the learning 
framework, especially in Barron ct al. (2008), Bincv ct al. (2005), Binev et al. (2007 
Binev et al. (2007b). In all these papers, consistency results are obtained under 
fewer assumptions but with no exponential bounds and a higher cost in imple- 
mentation. Again in the learning context, Temlyakov (2008) provides optimal 
critical value r| n as well as exponential bounds with fewer assumptions since 
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there is no coherence restriction. However, the procedure is very difficult to 
implement for large values of p and n (N-P hard). 

In Fan and Lv (2008), it is assumed that there exists k > such that 

min |oc.{| > 0(n~ K ) 
fez* 

where T* — {£, <X{ ^ 0}. The model under consideration is ultra high dimen- 
sioned: p < cxp[cn^) for c, £, > with the restriction £, < 1 — 2k < 1. The 
procedure SIS-D (SIS followed by Dantzig) is shown to be asymptotically con- 
sistent in the sense that, with large probability, we have 

v 

2j&si s -° - oc t ) 2 < Cr, n 

£=1 

where C is a constant depending on the restricted orthogonality constant, but 
the order of the convergence is not given. A practical drawback is that the 
tuning sequence y n is not auto driven since it has to verify y n — 0(nr e ) for 
8 < 1 — 2k — t for some t linked to the largest eigenvalue of the covariance matrix 
of the regressors. Notice that another tuning parameter A n has also to be chosen 
in the Dantzig step. 

In Bunea (2008) and Bunca et al. (2007b), the size p grows polynomially 
with the sample size n. It is assumed that 

1 n 

Sup - V_ \® U ®im\ < OfS- 1 )- 
feZ*, rngZ* Tl f-f 

which appears to be a weaker condition on the coherence than ours. How- 
ever this condition is impossible to verify on the data since I*, S are unknown. 
An exponential bound is established for P f£]L-| \&l ~ a *l > rM when fj > 

yS log p/n corresponding to ours critical value Un.. This result is comparable to 
ours but focuses on the error due to the estimation of the parameter a instead of 
the prediction error. In Candcs and Plan (2009), the condition on the coherence 
T n < 0((logp)~ 1 ) is generally lighter except for very large p but no exponential 
bounds are provided: it is proved that P(d(&{,ot{) >T|) is tending to zero as 
O (p- 21 °s 2 ) for ri > v/Slogp/n. 

6. Practical results 

In this section, an extensive computational study is conducted using LOL. The 
performances of LOL are studied over various ranges of level of indeterminacy 
5 = 1 — n/p and of sparsity rates p = S/n (see Maleki and Donoho (2009)). The 
influence of the design matrix is investigated: more precisely, as we consider 
random matrices, we study the role of the distribution for the design matrix O 
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as well as the nature of dependency between the inputs. This study is performed 
on simulations and an application with real data is presented. Our procedure is 
finally compared to some others well known two-step procedures. 

6. 1 . Experimental design 

The design matrix O considered in this study is generally of random type (except 
in the example of real data) and is mostly built on n x p independent and 
identically distributed inputs. Different distributions such as Gaussian, Uniform, 
Bernoulli, or Student laws are considered. We also investigate the influence of 
the dependency on the procedure. It is important to stress that all the above 
mentioned parameters p, n, dependency and different type of laws yield different 
values of the coherence T n and consequently different behaviors of the procedure. 
Each column vector of O is normalized to have unit norms. Given O, the target 
observations are Y = Oa+ £ for e i.i.d. variables with a normal distribution 
N (0, ff £ chosen such that the signal over noise ratio (SNR) is in most studies 
SNR=5. When specifies, SNR varies from SNR = 10 to SNR = 2. The vector of 
parameters a is simulated as follows: all coordinates are zero except S non zero 
coordinates with oti — {— l) b |z|, i — 1,...,S where b is drawn from a Bernoulli 
distribution with parameter 0.5 and z from a N(2, 1) (see Fan and Lv (2008)). 

To evaluate the quality of LOL, the relative I2 error of prediction Ey = 
|| Y — YH2/IIYH2 is computed on the target Y. The sparsity S is estimated by the 
cardinal of L = {I = 1 , . . . , p, a* ^ 0} where a* is provided by LOL. All these 
quantities are computed by averaging each estimation result over K replications 
of the experiment (K = 200). 

6.2. Algorithm 

The parameters A n (1 ) and A n (2) are critical values quite hard to tune practically 
because they depend on constants which arc not optimized and may be unavail- 
able in practice (such as the constant M -see the theoretical results-). Let us 
explain how we proceed in this study to adaptively determine the thresholds. 

Since the first threshold A n (1) is used to select the leaders, our aim is to 
split the set of "correlations" {Kf, I = 1, . . . ,p}, into two clusters in such a way 
that the leaders are forming one of the two clusters. The sparsity assumption 
suggests that the law of the correlations (in absolute value) should be a mixture 
of two distributions: one for the leaders (high correlations- positive mean) and 
one for the others (very small correlations- zero mean). The frontier between the 
clusters is then chosen by minimizing the variance between classes after adjusting 
the absolute value of the correlations into the two classes described above ( see 
also Kerkyacharian, Mougeot, Picard, and Tribouley (Kerkyacharian et al.)). 

The same procedure is used to threshold adaptively the estimated coefficients 
oil obtained by linear regression on the leaders. Again the distribution of the 
«{ provides two clusters: one cluster associated to the largest coefficients (in 
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absolute value) corresponding to the non zero coefficients and one cluster com- 
posed of coefficients close to zero, which should not be involved in the model. 
The frontier between the two clusters, which defines A n (2), is again computed 
by minimizing the deviance between the two classes of regression coefficients. 

Finally, an additional improvement for LOL is provided. It generally more 
efficient to perform a second regression using the final set of selected predictors 
involved in the model: the estimators of the (non zero) coefficients arc then 
slightly more accurate. This updating procedure is denoted LOL + in the sequel. 

6.3. Results with i.i.d. gaussian design matrices 

The design matrix O is first defined with i.i.d. gaussian variables. Figure 1 
(left) shows the evolution of the empirical coherence T n function of s/ri for 
p = 100, 1000, or 10000. Each coherence shown in the graph is the average of 
K = 500 coherence values computed for different CD matrix simulated at random 
over the K replications. As the number of observations increases to n = 5000 
(V5000 ~ 70.7), the cohe rence tends to be quite small (x n =0.1) independently 
of the number of variables p. For a small number of observations, the coherence 
takes pretty high values, much higher as the number of predictors increases. For 
example, for n = 250 (V250 ~ 15.8) 

p = 100 h-> x n = 0.25, p = 100 i } Tri = 0.30, p = 1000 i-» x n = 0.35. 

A difference of 15% is observed between the coherences computed for p = 1000 
and p = 100, or p = 1000 and p = 10000. Figure 1 (right) shows the evolution of 
the coherence as a function of ^/log(p)/n which allows to compute the constant 
c introduced in Theorem 1. 

Since we are interested by quantifying the performances of LOL in an over- 
whelming majority of cases (n, p varying), the impact of the level of indetermi- 
nacy and of the sparsity rate are studied: 5 is varying from to 0.9 by 0.05 step 
and p is varying from 0.01 to 0.16 by 20 steps. We fixe p = 1000 and n = 250 
for this specific study. 

Influence of the indeterminacy level: Figure 2 studies the performances 
of LOL when the indeterminacy level is varying (p = 1 000 fixed, n varying) , for 
different sparsity values (S = 10, 12, 15,20). The error of prediction Ey increases 
continuously with the indeterminacy 6, as the number of observations decreases 
compared to the number of variables. For a given value of 6, Ey decreases as 
the sparsity does. For 6 < 0.75, the prediction error is weak, below 5%. When 
the number of available observations is at least higher than half of the number 
of potential predictors (6 < 0.5), the prediction error is negligible: the quality 
of LOL is in this case exceptionally good. For a given number of observations 
and potential predictors, the prediction is more accurate as the sparsity rate 
decreases. For a fixed number of observations, regarding the joint values of both 
indeterminacy and sparsity parameters, the errors tends to be null as 6 and/or 
p are decreasing. 
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Influence of the sparsity rate: Figure 3 illustrates the performances of 
LOL for prediction when the sparsity rate is varying for four levels of indeter- 
minacy (6 = 0.4, 0.7, 0.75, 0.875). For small values of the sparsity rate (p < 5%), 
the prediction error is very good (less than 5%). For an extreme level of spar- 
sity (p < 2%), the performances are excellent. As observed before, for a given 
sparsity rate value, the performances are improved as the indeterminacy level is 
decreasing. 

Estimation of the Sparsity S: Figure 4 shows the estimation of the spar- 
sity provided by LOL as a function of the effective sparsity S. For small S 
(p < 5%), LOL is excellent because it estimates exactly (with no error) the 
sparsity S for all the studied indeterminacy levels. As the sparsity S increases, 
LOL underestimates the parameter S. For a given sparsity value, the under- 
estimation becomes weaker as the indeterminacy level 6 decreases. Comparing 
Figure 3 and Figure 4, we observe that the estimation sparsity is obviously linked 
to the prediction error which is not a surprise. 

Estimation of the coefficients: Figure 5 presents the improvements pro- 
vided by LOL + compared to LOL as a function of sparsity rate for the prediction 
error. For all indeterminacy and sparsity values, the prediction error decreases 
using LOL + procedure instead of LOL. The improvements are stronger as both 
sparsity rate and indeterminacy level increase. The improvements for the pre- 
diction error are observed as p increases given all studied indeterminacy levels 
6. Obviously, the estimated sparsity in the same for both procedures LOL and 
LOL+. 

Ultra high dimension: Table 3 shows the prediction error for ultra high 
dimension as p — 5000, p = 10000; p = 20000 and for two different values of 
n = 400 and n = 800. For small sparsity levels (S =5, 10, 20), the performances 
are similar even in a very high dimension as p = 20000. As in the previous studies 
in smaller dimension, for higher sparsity levels (S = 40, 60), the performances 
decrease as the sparsity level or the indeterminacy increases. 

6.4. Influence of dependence for gaussian design matrices 

In the simulations, all the predictors do not have the same influence because some 
predictors are directly involved in the model and some others not. Different type 
of dependency between the predictors can also be distinguished: dependency 
between two predictors involved (or not involved) in the real underlying model, 
and dependencies between two predictors: one involved in the model, the other 
not. These dependencies have not the same impact on the results. In order 
to simulate all possible dependencies, we first extract a ®n,2S sub matrix of CD 
defined by concatenating vertically the S columns of the predictors included in 
the model (and associated with non zero coefficients), and S columns between 
the (p — S + 1 ) predictors chosen at random not included in the model. Wi 
is the associated correlation matrix of CD n ,2s- A liew correlation matrix W2 
is then built by choosing randomly 5% or 20% of the correlations in Wi and 
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replacing their original value with random values of the form (— 1) b u, where 
b is drawn from a Bernoulli distribution with parameter 0.5 and u from an 
uniform distribution between [0.90; 0.95] such a way that W2 presents some high 
correlations between the 2S selected predictors. Since the correlation matrix of 

the 2S columns of Z := On^sW, 2 Wf is then W2, we replace the previously 
removed columns of O by the columns of Z. 

Figure 6 compares the prediction error for both dependent and independent 
cases. As expected, some dependency between the predictors damages the per- 
formances of LOL. When the sparsity increases, the impact of dependency seems 
to play a lower impact on the prediction error. 

6.5. Impact of the family distribution of the design matrix 

In this section, we investigate the impact of the distribution in a design ma- 
trix with i.i.d. entries. Eight different distributions are studied: Gaussian 
(N(0,1)), Uniform (U[-1,1]), Bernoulli (B{-1,+1» and Student (T(m) with 
m G {5,4,3,2,1}). Figure 7 shows the empirical density of the coherence T n 
computed for each law (n = 250). Similar distributions are observed for Gaus- 
sian, Uniform or Bernoulli laws with a mode of the coherence equal to x n = 0.30. 
For Student's families, a shift of the mode of the empirical distributions can be 
observed from left to right equaled to 0.36 for T(5), 0.47 for T(4), 0.68 for T(3), 
0.92 for T(2) to 0.99 for T(l). The prediction errors computed using LOL are 
presented in Table 1. For all distributions, the prediction errors increase with 
sparsity in average and in variability. As expected, regarding the coherence 
value, similar prediction errors are provided for Gaussian, Uniform, or Bernoulli 
laws. For the Student distributions T(m) with parameter ra > 2. the prediction 
results are also similar to Gaussian distribution. The Student distribution with 
m = 1 shows much higher prediction errors both in average and variability. Fig- 
ure 8 studies the estimation of sparsity using LOL as a function of the sparsity 
rate p. All the curves, except the one for the Student law T(l), are confounded 
and show similar behavior as the one observed for gaussian predictors (see Figure 
4 for 6 = 0.25). LOL provides similar results for Gaussian, Uniform, Bernoulli, 
or Student laws, T(ru) with m large enough. It is amazing to observe that the 
procedure works fine even when the empirical coherence T n reaches large values. 
However, LOL does not work fine for heavy tailed variables as for T(l). These 
results can be explained analyzing Figure 9 which shows the coherence of the 
matrix restricted to the N selected leaders. This restricted coherence is much 
lower than the coherence computed on all the predictors. For the Student T(l) 
law, T n = 0.99 (see Figure 7) while the coherence restricted to the leaders is 
0.3 (see Figure 9 by instance for S — 10). LOL provides also good results even 
when the global coherence approaches 1 . It seems then that the practical re- 
sults are much more optimistic than the theoretical ones, although they show 
deteriorations under high coherence. Conclusions would be that it could be in- 
teresting to find new measures of collinearity to reflect better the performances 
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of the method. This is true in general for all the methods concerned with high 
dimension. 

6.6. Comparison with other two-step procedures 

In this part, the performances of LOL arc compared with the performances 
of other two-step procedures which have been practically studied. The first 
one referred as SIS-Lasso is coming from Fan and Lv (2008): the selection step 
called SIS is followed by the Lasso procedure. The second one called Lasso-Reg, 
is proposed in Candes and Plan (2009). First, the Lasso algorithm performs the 
selection of the leaders and then, the coefficients are estimated with a regression. 
For simplicity of the presentation, wc do not include the results provided by 
greedy algorithms. 

The performances of the three procedures (LOL, SIS-Lasso, Lasso-Reg) are 
here studied over a large range of sparsity in order to cover previous results 
already presented in Fan and Lv (2008) and Candes and Plan (2009) for differ- 
ent sparsity. The number of initial predictors is p = 1000 and the number of 
observations n = 200. This experimental design allows us to analyze extremely 
small sparsity values (10 < S < 20) (as in Fan and Lv (2008)) as well as values 
as large as S = 60 (as in Candes and Plan (2009)). For the Lasso procedures, 
the regularization parameter is chosen by cross validation. Different signal over 
noise ratio are studied (SNR = 10, 5, 2). 

Tabic 2 presents the relative prediction error as defined for i.i.d. gaussian 
matrices but similar results are obtained with uniform, or Bernoulli distribution. 
Different cases of signal over noise ratio are studied (SNR = 10,5,2). The 
performances of the procedures appear to depend on the sparsity and on the 
signal over noise ratio. For small sparsity levels, (S = 10), all the procedures 
perform extremely well and the relative prediction error is similar to the inverse 
of the signal over noise ratio. For middle sparsity levels (20 < S < 30), Lasso- 
Rcg performs better than the others ones when the signal over noise ratio is 
high (SNR = 10 or 5). In this case, Lasso- Reg seems to be more efficient to 
select (during the first step) the leaders than both SIS-Lasso and LOL. For a 
low signal over noise ratio (SNR = 2), LOL performs better than Lasso-Rcg. 
The performances of SIS-Lasso and LOL are globally similar. 

For largest values of the sparsity level S > 50, it appears that SIS-Lasso and 
LOL are better than Lasso-Reg for middle values of the signal over noise ratio. 

We conclude that LOL has a special gain over the other procedures when the 
SNR is small or when the sparsity S is high. 

6.7. LOL in Boston 

In order to illustrate the performances of LOL on real data, we revisit the 
Boston Housing data (available from the UCI machine learning data base repos- 
itory: http://archive.ics.ucfi.edu/ml/) by fitting predictive models using LOL. 
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The original Boston Housing data have one continuous target variable Y (the 
median value of owner-occupied homes in USD) and po = 13 predictive vari- 
ables over n = 506 observations which are randomly split into two subsets: one 
training set with 75% of observations and one test set with the remaining 25% 
observations. 

In view to test our procedure, we consider the linear regression method as a 
benchmark and denote E Re9 the prediction error computed on the test set while 
the estimated model is computed on the training set. 

The data are 'dived' in a high dimensional space of size p = 211 3 by adding 
300 independent random variables of seven different laws: Normal, lognormal, 
Bernoulli, Uniform, exponential with parameter 2, Student T(2), T(1) in equal 
proportion. This set of laws is chosen to mimic the different underlaying laws 
of the 13 original variables. LOL is applied on the training set and the error of 
prediction E LOL is computed on the test set. This procedure is repeated K = 100 
times using re sampling, and the prediction errors are then averaged to compute 
the performances on the training and test sets. Observe that in this example, 
the indeterminacy level and the sparsity rate are quite low equal to 6 = 0.18 and 
p = 0.034. The coherence is quite high equal to T n = 0.98. 

The results are the following 

E LOL = 0.245(0.05) and E Reg = 0.266(0.04) 

and LOL appears to work in this case very well because similar prediction errors 
are obtained even from a high dimensional space p = 2113 as using a regular 
linear regression in po = 13 dimensions. 

7. Proofs 

First, we state preliminary results and next we prove Theorem 2 and Theorem 
1 as a consequence of Theorem 2. The proofs of the preliminaries are postponed 
in the appendix. 

For any subset of indices 2" C {l,...,p}, Vx denotes the subspace of R n 
spanned by the columns of the extracted matrix <D|j and Py x denotes the pro- 
jection over Vj (in euclidean sense in M n ). Set oc(I) the vector of M# {x] such 
that ®|ia(2~) := Pv x [3>a]. Obviously, as soon as #(2~) < N, we get 

ot(J) = (Of I 0| X )- 1 Of x Oa. 
As well, set &(2~) such that <S>\ X &{T) := Pv x (Y). 

7.1. Preliminaries 

The preliminaries contain three essential results for the subsequent proof. The 
first proposition describes the algebraic behavior of the euclidean norm of &[B) — 
oc when the vector is restricted to a (small) set of indices. The second lemma 
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is a consequence of the RIP property and gives an algebraic equivalent for the 
projection norm of vectors over spaces of small dimensions. The second proposi- 
tion (third result) describes the concentration property for projections norms of 
the vector of errors. This proposition is our major ingredient for proving all the 
exponential bounds. Note that it also incorporates the case where the projection 
has a possibly random range. 

Proposition 1. Let I be a subset of the leaders indices setB. Then 

Y_mB) - a e ) 2 < K(a)#m^ (1 + \\?v B [e.]\\i) + + J^W^Mfn 

lex 

where 

3(5 + v) 2 3 

K(a) = TT^r l|a|l ^p) v iT^y 



Lemma 2. Let I be a subset o/{1,...,p} satisfying #{Z) < N. Then, for 
any x € M. n , we get 

(1 Z - fz xi °«) ^ iiPv x x||[ 2(n) < (i -v)- 1 y_ - fz xi °«) 

lex n \i=i / lex n \i=i / 

Proposition 2. Letl be a non random subset o/{1, . . . ,p} such that #{T) < 
tlx, where tlj is a deterministic quantity, then 

irllPvxMlln > H 2 ) < cxp (-nu 2 /16) (15) 

for any \jl such that u 2 > 4nj/n. If now I is a random subset o/{1 , . . . ,p} such 
that < Tlx, where Tlx *s a deterministic quantity, then (15) is still true but 

for any \± such that \x 2 > 16nilogp/n. 

Proposition 1 and Proposition 2 are proved in the appendix as well as Lemma 

2. 



7.2. Proof of Theorem 2 

For sake of simplicity, and without loss of generality, we assume that the N 
largest a^'s have their indices in {1 , . . . , N}. We have 

v 

d(&*,a) < H| n + ||^(aj-a t )<D. t || n . 

«=1 
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Recall that B is the set of the indices of the leaders. Then 



£(ftj - Ot)<D.t||n < II Y- ^ G B ^ ~ «t)<Mln + II YA l $ B}0lt<S>,i 

«=1 £=1 £=1 

:= I (In) + (Out) . 



We split I into four terms by observing that : 

1 =1{\MB)\ > A n (2)}{l{|a f | > A n (2)/2} + I{|a e | < A n (2)/2}} 
+ < A n (2)}{l{|a £ | > 2A tt (2)} + I{|a { | < 2A n (2)}}. 

It follows that 



I < ( || £ I{« e - a <)°«« ^ A n(2)/2} > A n (2)}|| n 

V £=1 

N \ 

+ || Y_ M G S}a£0. £ I{|ai| > 2A n (2)} !{\& t (B)\ < A n (2)}|| n 

£=1 / 

+ ( || Y_ I« e #H&£ - I{l««l < An(2)/2} I{|& t (B)| > A n (2)}|| n 

V £=1 

+ || ^ I{< G B}a£0. £ I{|a t | < 2A n (2)} I{|ft £ (B)| < A n (2)}j| n J 

£ = 1 / 

:= IBB (InBigBig) + ISB (InSmallBig) + IBS (InBigSmall) + ISS (InSmallSmall) 



Note that because of Assumption (10), the coefficients such that |ct{| > A n (2)/2 
necessarily have their indices less than N, so some terms in the above sum 
have their summation up to N, some others up to p. This makes an important 
difference in the sequel because Lemma 1 can be used in the first case. Recall 
the definition of the oVs given in Algorithm 3.1 



1 rl 
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We have 

p N 

0<\\ Y_ I{«£B}a t 0.t||n+|l2I I{«£B}a f 0.«|| n 

£=N + 1 £=1 
p N 

<|| Y_ I{i^B}oc i d).i\\ n +\\Y_ I{«0B}a<®.«I{|a«l<2A 1T (1)}|| n 

£=N + 1 £ = 1 

N 

+ I{« £8}ae<D.{ I{|a*| > 2A n (1 )} I{|a e [ < A n (1 )}|| n 

£=1 
N 

+ || Y_ I{«0B}a«O. { I{|ad >2A n (1)}I{|ad > A n (1)}j| n 

£=1 

:= Ob (OutBias) + OS (OutSmall) + OBS (OutBigSmall) + OBB (OutBigBig) 
Using the Assumption (13) on the errors, we get 



S 

||u|| n < sup |u| < c \ —. 

i=l,...,n V Tl 

We deduce that for any r] such that 

Tl 2 >2c^, (16) 

P (d(&*, cc) > n) < P (I + O > u/2) 

< P (IBB + ISB + IBS + ISS > ti/4) + P (OS + OBB + OBS + Ob > n/4) 

Our aim is to prove that each probability term is bounded by cxp —yr\x\ 2 for any 

where the constants y and D have to be determined. To do this, basically, we 
study each term separately and prove that (up to constants) either it can be 
directly bounded, or it reduces to a random term whose probability of excess 
can be bounded using Proposition 2. 



7.2.1. Study of IBB and ISB 

Denote by T the (non random) set of indices {I — 1,...,N, \oli\ > A n (2)/2} 
which verifies #T < S by Assumption 10. Observe that 

{ IS >2K(2) n{2) => IW™<^W<l«e-W™ 
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Using Lemma 1, we deduce that 



ISB 2 < (1+v) Y_ mB))t + {oc t -(6i{B)) t )\ 2 l{\(6i{B))i\<\oct-{&{B))i\}i{M>2K{2)} 
(ems 

< 4(1 +y) Y_ ioc t -[6i(B))i) 2 . 



teTnB 



Using again Lemma 1, it follows that 

ISB 2 + IBB 2 < 5(1 +v) Y_ (««-(&(B))< 



feme 



We apply Proposition 1 



ISB 2 + IBB 2 < 5(1+v)( K (a)ST 2 (1+ ||P VB [£]|| 2 )+^ ; ^ + T ^ ; ||Pv T [e]|| 2 



We use now Proposition 2: first with the non random set T satisfying #(T) < S, 
secondly with the random set B such that #($) < N. For this second part, we 
use the last part of Proposition 2, which yields an additional logarithmic factor. 
We obtain 



( — l|Pv T e||n >r| 2 (1 -v)/(7680(1 +^)ff 2 



1 

a 2 

< 2cxp {-nr) 2 (1 - v)/(30720(1 + v)o 2 ) 
since S < N = v/x n and as soon as 



P( -zi ll p v 8 e|ln>il 2 (S'^r 1 7(2560(1 +v) K ( a )a 2 



r, 2 > 2560 f (1 +v) K (a)ST 2 V-^--^) V 3072oj^-^ff 2 - 
\ 1-vny (1 + v) n 

V40960v(1 +Y)vK(a)o- 2 STn l0gP . (17) 

n 
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7.2.2. Study of Ob 

Since the CD.j's are normalized vectors and because of the definition of the co- 
herence, we get 



ob < ii y_ a «^.«IU 



£>N + 1 



< 



< 



£>N + 1 



£>N + 1 



1/2 



Y_ <4l{\oc e \ < A n (2)/2} + x n I a «l 



£>N + 1 



£>N + 1 



1 2 



As A n (2) < A n (1) and using Assumption (12) and Assumption (11), we obtain 



Ob < (ci + c 2 ) 



Slogp 
n 



which implies that Ob < r|/16 as soon as 



u 2 > 256(c 1+ c 2 ) 2 ^. 

n 



(18) 



7.2.3. Study of ISS and OS 

As A n (1) > A n (2), using successively Lemma 1 and Assumption (12), we have 



ISS < Ob + || Y_ I{< e S}a £ cp. £ I{|a { | < 2A n (2)} I{|&«(B)| < A n (2)}|| 



£ = 1 



<Ob + (1+v) 1/2 ^a { 2 I{|a { |<2A n (1)}^ 



1/2 



< (ci + c 2 ) 



n V n 



This implies that ISS < r|/16 as soon as 



u 2 >512((c 1+ c 2 ) 2 + n+v)c 2 ) S1 ° gP 



(19) 



In the same way, OS < r|/16. 
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7.2.4. Study of OBS 

Using the model and the definition of a £ given in Algorithm 3.1, we get 



1 



i n p 

ct e = - Y[Y a m O im + Ui + £i]O i{ . 

i=l m=1 



Since O has normalized columns, we can write 



n — 



i=1 



which implies that 

lot — a«| < 



1 n / p \ 1 n 

1=1 \m=1 / i=1 



+ 



- y_ £iO ie 

n ^ — 



< 



X" <* m - y_ (4>i m €>i £ ; 

' — TL L — 

m=1,m^{ i=1 

P 

m=1 ,m^t 



+ 



n z — 



n ^ — 

1=1 

1 n 
n ^ — 



+ 



n ^ — 



i=1 



i=1 



(20) 



Recall that A n (1) > A n (2). We get 

I{|a«| < A n (l)} I{|a £ | > 2A n (1)} < I{|««-c5| > A n (1 ) > |c£|} I{|a«| > A n (2)/2}. 
Hence, using Lemma 1, it follows 



OBS 2 < (l+v)^(a £ -^ + ^) 2 I{|a £ -^[>A n (1)>|^|}I{|a £ |>A n (2)/2} 
f=i 

N 

< 4(1 +v)^(a £ -^) 2 I{|a £ | > A n (2)/2}. (21) 



Denote T the (non random) set of indices {I = 1, .. . , N, |a £ | > A n (2)/2}. Using 
inequality (20), we obtain 



OBS 2 < 8(1 +v 



2jZ M^ n ) 2 +2Y_ -Z u ^ 

«er C'=l «er \ \ n i=i 

OBSi +OBS2 + OBS3. 



1 



n 



1=1 



By Assumption (10), we have #T < S implying 

2 

li'(p) L n 



OBSt <8(1 + v)H| 2 t 2 S. 
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Using Lemma 2 and Assumption (13) on the errors u, we get 

OBS 2 < 16(1 +v) ||Pv r M||i < 16(1 +v) ||u|| n < 16(1 +v)c^ 

and 

OBS 3 < 16(1 +v)||P Vr [£]||i 
Proposition 2 ensures that 

P(OBS>u/16) < P f-L||P Vr [ £ ]||2 >u 2 /(8192cr 2 (1 +v)) 

< exp(-nri 2 /(131072ff 2 (l +y))) 

as soon as 

u 2 > 8192(1 +y) (j|a|| 2 , (p) ST 2 V(2c 2 V4o- 2 )^ . (22) 

7.2.5. Study of OBB 

Observe that the (random) set of indices 

T = {t?B, \ou\ > 2X^(1) ,|a { |> A n (1)} 

has no more than S elements (using Assumption (10) with A n (1) > A n (2)) and 
is equal to 71 U75 where 71 = T n{£, |a £ | < |a £ |/2} and 75 = T n{£, |a { | > |a«|/2}. 
On the one hand, we obviously have 

71 c{«£B, |cc £ | >2A n (1) Joel <2|a f -a £ [}. (23) 

On the other hand, since I ^ B while |oc.{| > A n (1), there exist at least N (leader) 
indices I' in {1, ... ,p} such that |ct{/| > |oc.{|. Moreover Assumption (10) ensures 
that there is no more than S indices I' such that \oci'\ > A n (1)/2. Thus, using 
the fact that S < N, we deduce that there exists at least one index depending 
on I called I* (£) such that 

< A n (1)/2 and |a £ . m | > |a<|. 
Since i £ 75, this implies that 

\a. t *( t ) -a { .( { )]> |a«|/4. (24) 
Using (23) and (24), it follows that 



OBB 2 < (1 + y 
< (1 +y 



Y_ \oci\ 2 + Y_ 

Her, ieT 2 

N 

^4|a £ -ct £ | 2 I{|ct £ |>2A n (1)} + Y_ 16|a £ *(£) — a £ « 
e=i «e r 2 

:= OBBt + OBB 2 . 
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Since A n (1) > A n (2), OBB] can be bounded as OBS. The computations are 
exactly the same for the term OBB2 except that the set Ti is now random and 
the conditions on T) become 

T) 2 > 32768(1 +y) (j|a|| 2 , (p) St 2 V2c 2 ,^ V 16ct 2 ;^^ . (25) 

For such an rj, we obtain 

P(OBB>ti/16) < cxp(-nri 2 /(524288o- 2 (1 +v))) . 



7.2.6. Study of IBS 

Note here that the major difficulty lies in the fact that the summation is not 
on the set of indices i < N as for the other terms. Let T, T' be the subsets of 
{1 , . . . , p} defined as follows 

T = {leB y |SJI>*n(1), \MB)\ > Xn(2), |a e | < A n (2)/2} 

and 

T'={tEB, lot? — cx £ | > A n (l)/2, \$Li{B)-ou\> A n (2)/2, \cn\ < A n (2)/2} 
and observe that T C T' (using again that A n (2) < A^Cl)). Denote 

K(T) = # (Tn {£, lot -ail > A n (l)/2}) 
and put k = [s^t^tx^J A N k = Lf^J AN. We get 

P(IBS >ri/16) < P(IBS > ti/16 and K(T) < k )+P(IBS >u/16 and K(T) > k ) :=pi+p 2 - 

Notice that p 2 = when ko = N since T C B. To bound pi , we proceed rather 
roughly. By Proposition 1, we get, for any k < ko 

P(IBS >r|/16 and K(T) = k) < P j (1 +y) (ct { - &U?) £ ) 2 > r, 2 /256 and K(T) = k j 

V (ems / 

- P (J2 l|Pv T e||n >T1 2 H -v)/(15360(l +v)cr 2 ) and K(T) = 

+ P (^2 H P v B £!ln >ri 2 (kT 2 )-V(5120(l +v) K (a)ff 2 ) and K(T) = k)^ 

< exp (-nri 2 (1 - v)/(245760(1 + v)cr 2 )) 
+ cxp(-nTi 2 (kT 2 )- 1 /(8192(l +v) K (a)o- 2 )) 
<2exp(-nri 2 (l - v)/(245760(1 + y)o 2 )) 
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because kt 2 < koT 2 < Nt^ < 1 . The previous bound is valid for any k < ko as 
soon as 

/ r 2 S 

n 2 >5120 k(o)St 2 V— 2 

\ I — v n 

V 61440^o- 2 ^V 81920(1 +v)cx 2 K(a^ 
1 — "v n 



which is equivalent to 



if 



K(a)Sx 2 + y^^- ) 



A 2 (1) > cte6144l^o- 2 - Vcte81920v(1 + v)g 2 K (a) Tn lQgP . (27) 
1 — v n n 

Finally, we get 

pi < Y_ Y P(IBS > n/1 6 and K(T) = k) 

k<k r,K(T)=k 

< Y_ (p k 2exp(-nT 1 2 (1 -v)/(245760(1 + v)cr 2 ))) 

k<k 

<2exp( - (nu 2 (1 - v)/(245760(1 + v)cr 2 )) U m J f '°g An - 2 , 

[ v y \ nr^O — v)/(245760(1 + y}a 2 J 

<2cxp(-nri 2 (1 -v)/(491520(1 + v)cr 2 ))) 
thanks to the choice of ko and as soon as 

A 2 (1) > 2*245760(cte)- 1 ji^o- 2 ^^. (28) 

To bound p2 (only in the case where ko < N), we proceed as above, considering 
all the (non random) possible sets for T. The inclusion TcT' ensures that 

k>k T,K(T)=k leT 

We already have seen that 

2J^-^ 2 < 2t 2 ^[ ^ |a m |] 2 +4^[l^u i O i£ ] 2 +4^^Z £ ^ (1) « ]2 

leT leT m=l leT i=1 leT i=1 

with 

YS f_ lam|] 2 <#(r)||a|| 2 (p) and ^ Ui(D if ] 2 < (1 + v)c 2 -. 

leT m=i leT n i=i n 



29 



It follows 



l+k <k<N {GT i=1 



as soon as 

S 
n 



2^||a||2, (p) < A n (1) 2 /16 and 4(1 + v)c§- < k A n (1 ) 2 /16. (29) 



Recall that ko = Lf^jyJ ^ N. Then the second condition is satisfied as soon as 

U 2 >64(cte)- 1 (1 +v)c 2 A (30) 
Using again Lemma 2, it follows 

P2< X P k P(Jj l|Pv r M|| 2 >kA n (1) 2 /(32(1 +v)a 2 )) 

1+k <k<N 

< Y_ p k exp(-u(1 +v)kA rt (1) 2 /(512(1 +v)cr 2 )) 

1+k <k<N 

klogp 



< Y_ exp(-[nkA n (1) 2 /(1024(1 +v)cr 2 ; 

1+k <k<N ^ 

< Y_ exp(-nkA n ( I) 2 / (2048(1 + v)cr 2 )) 

1+k <k<N 



1 - 



nkA n (1) 2 /(1024(1 + v)cx 2 ) 



for 



A 2 (1) > 2048a 2 (1 +v) (31) 



It follows that 

P2 <cxp(-nk A n (1) 2 /(2048(1 +v)cr 2 )) 
and replacing ko, we conclude that 

P2 < cxp(-nri 2 cte/(2048ff 2 (1 + v) 2 )) . 



7.2.7. End of the proof 

We now use Assumption (9) ensuring that M is the radius of the I 1 — ball of the 
ct's to bound K] (a) by (12M 2 +5c 2 J/(1 -v) 2 . Collecting the conditions (27), 
(28), (29) and (31) and on the level A n (1), we obtain the constraint 

491520 \ logp w „ w2 , 



A 2 (1) > ff 2 (1 +y) 2048 V V32M 2 t 

cte(1 — v) I n 



30 



Moreover r\ has to satisfy successively the conditions (17), (16), (18), (19), (22), 
(26) and (30) leading to the final condition 

r^D^VSx 2 ) 

for (revoir) 

D = ri 512 ° f12M 2 V5cg)V163840 fi ^ „ V512(d +c 2 ) 2 V65536(M 2 + 16ct 2 ). 
(1-v) 2 (1-v) 2 

For such an rj, we have 

P (d(&*, ct) > u) < P (IBB + ISB + IBS + ISS > u/4) + P (OS + OBB + OBS + Ob > u/4) 
< P (IBB + ISB > u/8) + P (IBS > ti/8 - ISS) 
+ P (OBB > n/8 - OS) + P (OBS > n/8 - Ob) 

which is bounded by 8 exp(— rtr| 2 /Y) for 

y = C(1 +v)cr 2 (1 +(1 +v)ct 2 ) 

where C is an universal numerical constant. 



7.3. Proof of Theorem 1 

To prove that Theorem 1 is a consequence of Theorem 2, we need to prove 

B q (M) c V ^M q (T 3 /2)~ q ^^~^ q/ > and B (S,M) c V(S,M) 

for (co,Ci,C2) to be specified. First, assume that a e B q (M) for q e (0,1]. 
Since q < 1, we have ||ot||ii ( p ) < ll a llii(p) < M. an d (9) is satisfied. Since 
(2) > T3ylog p/n and using Markov Inequality, we get 



#{£ = 1,...,p, |at|>\a(2)/2}<#il = 1,...,p, |ad>^/ ] 



T3 /logp 



2 V n 



This proves (10) with S = M q {^f\f^^\ ■ When q = 1, assuming that the 
coherence T n satisfies T n < ctlogp/n.) 1 / 2 , observe that 

1/2 /,„„„ x 1/4 / 2 \ 1/2 / 2 \ 1/2 



' Slogp _ /2MA / logp 
riT n ~ V T 3 J \TLxiJ ~ \cl 3 Mj ""-VcT 3 M 
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and thus (11) is verified for ci = (CMT3/2) 1 / 2 . When q e (0,1), using again 
Markov inequality, we get 

£ = #{£ = 1,...,p, |a £ |>|a w |}<M" |a w r q } 
leading to the bound |a ({) | < M£ _1 /q . Recall that Nx n = v. Thus, for q e (0, 1 ) 

£>N £>N 

Mv 1-Vq T Vq-1 < Mv 1 - 1 ^ 



Notice that 



nr n /H \ /Slogp 



Slogp / V TLT 



|!1, M -, (T3/2) ,_J_^-.(!M£)" 2 



is bounded by a constant when logp/n < c'. This implies (11). Now, we get 
«=1 



n 



J ( „, |T3/2r , ) (!2£l)- ,/2 ) 52iZ< ci s!2|£ 



which proves (12) with c\ = (2T 4 ) 2 - q (T 3 /2) q . This ends the proof of Theorem 
1 when q <E (0, 1]. We finish with the case where a belongs to Bo(S, M) which is 
very simple since we have (9) and (10) for free. (11) is obviously true with C] = 
and (12) is true with C2 = 4T 4 because there are only S non zero coefficients and 
thus 

P c 1 

V \ou\ 2 I{\a t \ < 2A n (1)} < S(2A n (l)) 2 < 4T 4 ^^. 

h 

8. Appendix 

Recall that <x(I) is the vector of R*' 1 ' such that 0|xcc(l) := Pv x [ < l )| ^]- As soon 
as < N, 

a[X) = (Of I 0| X )- 1 Of x Oa. 

As well, 6t(I) has been defined by ®|x&CT := Pv x (Y). Using the setting (1), we 
get 

0|x&(2~) =P Vl [<Da + u+£] =«| I a(J)+P Vl [u+e]. (32) 
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8. 1 . Proof of Lemma 2 

Recall that the Gram matrix is defined by M(I) = n _1 <Df x <D|x- Let x e W 
Since 

we obtain 

l|Pv*x||? 2(n) = (<Df x x)* (nM(l))-' (<Df z x). 
Applying the RIP Property (4) and observing that 

eez \i=i 

we obtain the announced result. 




8.2. Proof of Proposition 1 

We have 

2jat-a(Bj £ ) 2 = \\a ]x -rtJ3) 



|illi 2 (#(D) 

< 3 (||a|i-a(X)||? 2(#CI)) + ||a(I) - a(2T)||^ (#(I)] 
:=3(ti(J)+t 2 (2:)+t3). 



|a(I)-a(B),. 1,2 



Since 



we get, using twice the RIP Property 
1 



ti (?) < 



< 



< 



< 



1 - 


V 


1 




1 - 


V 


1 + 


V 


1 - 


V 


1 H 


- V 


1 - 


- V 


1 + 


V 


1 - 


V 


1 + 


V 



i llfii^ix^iicii 2 ,^^)) 



1 -v 



tei \fei c i=i 



n * — 



1=1 



£ex \t'ei c 



#(2"W 2 ||a| 
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Using Lemma 2, Equality (32), we get 



till) < ||0| X «(X) - ® ]x ot(l}\\l 



1 

]-y 



< l|Pvx[e+u]||i 



By Assumption (13) on the errors u, we deduce 

1 /„„ r ,, 2 , yS 



t 2 {x)<— \ \\?vM\\t + 4- 

Now, since I C £>, wc obtain 

0| X S(J) - OiiStB),! = P Vx [<5| X ^(J) - 0| X c^) |:r ] 

= PyJOix^m - <D| B £(B) + 0|evca(B)|BVc] 

= Pvz [Pv x l®ot + u + e] - P Vs [Oa + u + e] + 0\ B \xoc[B] lB \ X ] 

= Pvx[<5|B\xo(fi)| B \ z ]. 

Combining with Lemma 1 and Lemma 2, it leads to 

t 3 < ||0|ZO(Z5 - ulln 

iPvztOiBM^B), 



< 



1 _ ll r V x L^|e\I^l^J|i3\li|ln. 
1 1 



1 - v ^ n 2 , 

<y^;#m <||®llfi (#CB)) 
4 



< 



#(I) < f ||cx(B) - a(B)||f, (#(B)) + ||a(B) - <, (#(B)) + \Hv mB)) ■ 



1 — "V 

Now, since < N and N =y/x n , we obtain 

t 3 < #m < (ti (B) + t 2 (B) + ||a|fr [#(B) , 

< ^ #LT) x 2 n (1±^N<£ ||a|fr (pl + (\\Pv B mi + c^) + N 
A_ #mT ,„^ ^+v), ,^ , 4vc 2 _ S 

This ends the proof. 



2 



< _ < [p) I -^Tn + 11+ IT3 ^ I Tn- + TO <||Pv s [e]|K 
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8.3. Proof of Proposition 2 

First, we prove the part concerning the non random set T. The following propo- 
sition gives the concentration inequalities when the errors Ei's are gaussian. Note 
that a corresponding inequality stating concentration for projections of subgaus- 
sian variables can be found in Proposition 5.1 (with possibly not the optimal 
constants as stated by the authors) in Huang et al. (2009). 

Lemma 3. Let k be a positive integer and U be a x£ variable. Then 

Vu 2 > 4-, P(- U > u 2 ) < cxp (-nu 2 /8) . 
n n v ' 

Recall the following result by Massart (2007). If X t is be a centered gaussian 
process such that a 2 :— sup t EX 2 , then 



Vy>0, P ^supXt -EsupX t > yj < cxp-^j. 
Let Zi , . . . , Zk i.i.d. standard Gaussian variables such that 

k k 



(33) 



P(U > nu 2 ) = P{Y_ Z 2 > nu 2 ) = P( sup Y_ Q t z t > (™J. 2 ) 1/2 ) 
i=i QGS| i=i 

(k k k \ 

Sup Y_ Q t Z i - E SU P Y. aiZi ~ ( nu2 ) V2 - E SU P Y. aiZi I 

where Si = {a <G R k , HaiH^fj.) = 1}. Denote 

k k 

X a = ^ QiZi and y = (nu 2 ) 1/2 -E sup ^~ ajZj. 



i=1 



i=1 



Notice that 



as well as 



a€S, ^E(X a ) 2 = 1 



E sup X Q = E 
aes. 



i=1 



1/2 



< 



E^Z 2 



i=1 



1/2 



Since u 2 > 4^, the announced result is proved as soon as y > (nu 2 ) 1 / 2 /2. 

Assume now that I is random and take into account all the non random 
possibilities I' for the set X and applying Proposition 2 in the non random case. 
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We get 



P (^!l P VxM|l^ ( n,>H 2 



) 




) 



) 



< exp (-tim 2 /! 6) 



as soon as \v 



2 > 16 Tlx logp/n. 
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Fig. 1. Y-axis: Coherence t n . X-axis: Vn(left) or^^Lei (right) forp = 100 (dashdot 
line or triangle -green), p = 1000 (solid line or square -red), p = 10000 (dash line or circle 
-blue). K = 500 



Table 1. Prediction error for varying sparsities S and different distributions for the regressors, n = 
250, p = 1000. SNR = 5. 



s 


G 


u 


B 


T(5) 


T(4) 


T(2) 


T(l) 


5 


0.00 (0.0) 


0.00 (0.00) 


0.00 (0.00) 


0.00 (0.00) 


0.00 (0.00) 


0.00 (0.00) 


0.00 (0.01) 


10 


0.00 (0.01) 


0.00 (0.02) 


0.00 (0.01) 


0.00 (0.00) 


0.00 (0.01) 


0.00 (0.00) 


0.00 (0.05) 


15 


0.01 (0.02) 


0.02 (0.03) 


0.02 (0.02) 


0.03 (0.03) 


0.01 (0.02) 


0.01 (0.02) 


0.01 (0.07) 


20 


0.04 (0.03) 


0.03 (0.03) 


0.03 (0.03) 


0.05 (0.04) 


0.03 (0.03) 


0.03 (0.03) 


0.04 (0.12) 


25 


0.07 (0.04) 


0.07 (0.05) 


0.06 (0.04) 


0.06 (0.03) 


0.07 (0.04) 


0.07 (0.04) 


0.08 (0.14) 


30 


0.10 (0.06) 


0.11 (0.06) 


0.08 (0.03) 


0.08 (0.04) 


0.11 (0.05) 


0.10 (0.05) 


0.17 (0.24) 


35 


0.15 (0.06) 


0.14 (0.07) 


0.15 (0.08) 


0.14 (0.06) 


0.13 (0.06) 


0.16 (0.07) 


0.25 (0.26) 


40 


0.19 (0.07) 


0.17 (0.06) 


0.17 (0.07) 


0.17 (0.06) 


0.18 (0.07) 


0.21 (0.09) 


0.35 (0.27) 



Table 2. Prediction errors for varying sparsity S and varying SNR computed using LOL, SIS-Reg and 
SIS-Lasso procedures, n = 200, p = 1000, K = 100. 



SNR 


method 


S = 10 


S =20 


S = 30 


S =50 


S = 60 


10 
10 
10 


LOL 

SIS-Lasso 
Lasso- Reg 


0.146 (0.141) 
0.161 (0.103) 
0.096 (0.005) 


0.273 (0.110) 
0.389 (0.035) 
0.095 (0.005) 


0.381 (0.068) 
0.477 (0.030) 
0.165 (0.102) 


0.491 (0.118) 
0.543 (0.029) 
0.486 (0.121) 


0.462 (0.108) 
0.554 (0.028) 
0.472 (0.101) 


5 
5 
5 


LOL 

SIS-Lasso 
Lasso- Reg 


0.228 (0.073) 
0.223 (0.048) 
0.188 (0.011) 


0.351 (0.077) 
0.388 (0.053) 
0.192 (0.016) 


0.436 (0.123) 
0.476 (0.029) 
0.323 (0.090) 


0.478 (0.093) 
0.543 (0.030) 
0.466 (0.095) 


0.496 (0.067) 
0.562 (0.032) 
0.523 (0.124) 


2 
2 
2 


LOL 

SIS-Lasso 
Lasso- Reg 


0.388 (0.071) 
0.418 (0.035) 
0.459 (0.052) 


0.463 (0.084) 
0.509 (0.026) 
0.514 (0.069) 


0.472 (0.060) 
0.541 (0.031) 
0.523 (0.065) 


0.560 (0.150) 
0.589 (0.033) 
0.581 (0.153) 


0.545 (0.104) 
0.613 (0.032) 
0.597 (0.112) 
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Fig. 2. X— axis: indeterminacy level 6, Y— axis: relative prediction error. S = 10 (solid 
line-red); S = 12 (dashdot line-blue); S = 15 (dash line-green); S = 20 (dot line-black). 

SNR = 5. 



Table 3. Prediction errors for varying ultra high dimension p and sparsity S computed using LOL. 



SNR = 5, K = 100. 



p 


n 


S 


5 


10 


20 


40 


60 


5000 


400 
800 


0.195 (0.007) 
0.195 (0.004) 


0.194 (0.006) 
0.195 (0.005) 


0.236 (0.051) 
0.196 (0.012) 


0.426 (0.058) 
0.234 (0.036) 


0.497 (0.065) 
0.340 (0.046) 


10000 


400 
800 


0.195 (0.008) 
0.196 (0.004) 


0.193 (0.007) 
0.195 (0.005) 


0.244 (0.064) 
0.193 (0.005) 


0.420 (0.052) 
0.236 (0.043) 


0.443 (0.068) 
0.348 (0.050) 


20000 


400 
800 


0.204 (0.063) 
0.193 (0.004) 


0.201 (0.049) 
0.195 (0.005) 


0.277 (0.088) 
0.194 (0.004) 


0.408 (0.074) 
0.242 (0.036) 


0.401 (0.074) 
0.395 (0.055) 



40 



1 

/ 
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Fig. 3. X-axis: sparsity rate p, Y-axis: relative prediction error. 6 = 0.4 (dot line-black); 
6 = 0.7 (dashdot line-blue); 5 = 0.75 (solid line-red); 6 = 0.875, (dashed line-green). 
SNR = 5. 
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Fig. 4. LOL Sparsity Estimation ( p: bottom, left; S: right, top). 6 = 0.875 (dashed 
line-green); 5 = 0.75 (solid line-red); 5 = 0.7 (dashdot line-blue); 6 = 0.4 (dot line-black). 



42 



0.35 



0.3- 




0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 

Fig. 5. X-axis: sparsity rate p. Y-axis: relative prediction errors for LOL (dot lines) and 
LOL+ (solid lines). 6 = 0.4 (blue color); 6 = 0.75 (red color); 5 = 0.875 (green color). The 
regressors are Gaussian of size n = 250. SNR = 5. 
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Fig. 6. X-axis: sparsity S. Y-axis: relative prediction errors, for LOL with independent 
regressors (solid line-red) and dependent regressors (5% of dependency, dashdot line- 
blue; 20%, dashed line-blue), p = 1000, n = 250, K = 100. SNR = 5. 
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Fig. 7. Empirical densities of the coherence r n - The regressors are Gaussien (solid 
line-red); uniform (solid line-blue); Bernoulli (solid line-green); Student 5,4, 3,2, 1 black 
lines from left to right, n = 250,p = 1000. 
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Fig. 8. LOLSparsity estimation for different distributions for the predictors. Gauss (solid 
line-red); Uniform (solid line-blue); Bernoulli: (solid line-green); T(2-5) (black-lines); 
T(1) (dot black line), n = 250, p = 1000. (K = 200) 



4G 




Fig. 9. X-axis: sparsity S. Y-axis: Coherence T n computed for the N selected Leaders. 
Gauss (solid line-red); Uniform, (solid line-blue); Bernoulli: (solid line-green); T(1) (dot 
line-black), n = 250, p = 1000. (K = 200) 
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